The RStudio
interface is divided into several key areas, each serving a specific
purpose. - One of the great features of RStudio is that you can
customize the layout by reorganizing these windows to suit your
workflow.
*Hint:* You can rearrange these windows and tabs to fit your personal preference by dragging them around the workspace.<!--chloe make a video on rearranging windows and resetting--> When you rearrange the panes in RStudio on your computer, the layout stays as you set it across future sessions.
Main components of the RStudio interface:
Code Editor: This is where you write and edit your R scripts.
Console: The console is where R code is executed.
You can type commands directly into the console, and it displays outputs, messages, and errors.
You might prefer to use the console for immediate execution, or testing of small code snippets or commands.
Files/Plots/Packages/Help Pane:
Files: Browse, open, and manage files in your working directory.
Plots: View graphical outputs from your R code, such as plots and graphs.
Packages: Install, update, and load R packages.
Help: Access R documentation and help files for:
Task 1.1: Open RStudio and get familiar with the interface by identifying the 4 windows and switching between the tabs.
*Note:* This task is just for you to get comfortable. There is no solution for this task. <br>
Use the code editor if you want to develop more complex, reusable,
and maintainable code that can be saved and executed later. - We won’t
be working in the code editor at this level. - It will be introduced at
the beginning of the Intermediate level workshop.
The console is a lot like working in Terminal (mac) or Command
Prompt/PowerShell (PC). - Each new command line begins with the angle
bracket > also known as the ‘prompt’ symbol.
You will type the commands into the Console after the most recent
angle bracket > also known as the ‘prompt’ symbol. -
When you are ready to execute (‘run’) the command, type ‘enter’ or
‘return’ key on your keyboard. - The output to the command will appear
below your command.
Things to be mindful of:
You cannot execute a command until the previous command has been completely executed.
If you don’t see the prompt symbol, one of two things is happening:
R is still processing your previous command, and you must wait for it to finish.
You might instead see the plus + symbol, which
indicates that you have entered an incomplete command.
If you see the + symbol, you must enter the
remainder of the command before entering a new one.
An error will occur if you write the + symbol into
your command.
Sometimes the output can be extensive and show more information than you expected
E.g., when you load in a package (we will discuss packages more
in Activity 3).
For all tasks in this workshop, enter your commands in the Console
(bottom left).
Task 1.2: Try getting help! To do this, you’ll run the
help() function. Try getting information on vectors.
#Get additional information about "vectors" (a data type),
help("vector") # then type 'enter' or 'return'
help("vector") will provide you with information
about the mean function in RStudio. - The help information will be
displayed in the Console following your command.
*Note:* You can get help on related content by selecting the dropdown list at the top of the Help tab. <!--screenshot-->
As you work through these activities, remember to save your workspace. - Save your workspace by clicking on the top menu bar: - File - Save
Remember: Write all of your code in the Console tab.
*Note:* For the purposes of this workshop, 'variable' and 'data object' are used interchangeably.
To create any data object: - the command will begin with the a name
for the new variable - followed by: - an assignment operator
<-, - and then the data or expression that defines the
content of the variable. - This can include direct values, function
calls, operations, or other variables.
variableName <- "word"
Definition - “Function”: A set of instructions defined to perform a specific task.
Definition - “Function Call”: The act of executing a
function with specific arguments, if required, to produce a result.
Let’s start by looking at types of variables.
Definition - “Basic Data Types”: Types of data representing the simplest forms of data.
Basic Data Types:
Here we’ll look at basic operations with character
variables.
Task 2.1.1: Create a variable for a pig’s first name. |
The first pig's first name is 'Bart'.
#assign the first name 'Bart' to the first pig (pig1)
pig1.first_name <- "Bart"
Task 2.1.2: Create a variable for a Bart’s last name. |
Bart's last name is 'Smith'.
#assign the last name 'Smith' to the first pig (pig1)
pig1.last_name <- "Smith"
Task 2.1.3:
Create a variable that equals Bart’s first and last name, then
display the full name in the console
#concatenate the first pig's (pig1) first ('Bart') and last name ('Smith')
pig1.full_name <- paste(pig1.first_name, pig1.last_name)
#after pig1.full_name has been created, print (display) Bart's full name...
pig1.full_name
## [1] "Bart Smith"
Hint: To combine two strings separated by a space, use the
paste() function.
Now we’ll look at basic operations with numeric and integer variables. First we’ll create height information for Bart and find out how much he’s grown in height.
Task 2.1.4: Create a variable for Bart’s height as a piglet. | Bart’s piglet height: 10
#Assign the value of Bart's piglet height
pig1.heightA <- 10
Task 2.1.5: Create a variable for Bart’s height now. | Bart’s adult height: 22
#Assign the value of Bart's current height
pig1.heightB <- 22
Task 2.1.6:
Now create a variable expressing the amount he’s grown.
# Find the difference in height using the expression: 'heightB - heightA'
# using the subtraction operator.
pig1.heightGain <- pig1.heightB - pig1.heightA
#after pig1.heightGain has been created, print (display) the value of pig.heightGain...
pig1.heightGain
## [1] 12
Hint: “Expressing” indicates that the value will require an expression, in this case, a mathematical operation.
pig1.heightA is an ‘integer’ data type (whole
number)
pig1.heightB is a ‘numeric’ data type (decmial
number)
R can perform operations on different data types like getting the difference of a value.
Reminder! Save your work
**Additional:** To display all objects you have created, execute the 'list' function in the console: `ls()`. \> *Note:* 'l' in 'ls' is the lowercase 'L'.
**Additional:** To remove data objects from your environment, execute the 'remove' function in the console: `rm()`.
e.g., rm(full_name)
Time for logical or boolean values!
We can denote if Bart is small or large with a boolean value.
Task 2.1.7: REWORK default pig1.size Create two variables denoting Bart’s general size. The Bart can either be ‘mini’ or ‘large’. Note that Bart is a large pig.
pig1.mini <- FALSE
pig1.large <- TRUE
Hint: Boolean values are either ‘TRUE’ or ‘FALSE’ (case sensitive).
If you have followed the code provided in the activities exactly, the Variables list in your Environment tab should look the same as that in the image below. If it doesn’t match and you are unsure why, check with the instructor. no image
Additional: Use the ls() function to see all of the variables in our environment so far.
ls()
## [1] "pig1.first_name" "pig1.full_name" "pig1.heightA" "pig1.heightB"
## [5] "pig1.heightGain" "pig1.large" "pig1.last_name" "pig1.mini"
A vector is a 1-dimensional list of items that are of the same data type (all text, all whole numbers, etc.)
To create a vector object, you will use the c()
function.
The ‘c’ stands for ‘combine’ or ‘concatenate.’
It’s used to create a vector by grouping individual values into a list-like structure.
Think of it as placing items into a container where each item remains distinct and can be individually accessed.
vector1 <- c(val1, val2) creates a
vector named ‘vector1’ containing the elements ‘val1’ and ‘val2’ as
separate items in a sequence, not as a single merged item.A value in a vector can be accessed by using square brackets and its index (the value’s place in the vector), where 1 is the first index.
vector1[1] will output: ‘val1’Note: We will use the term ‘concatenate’ later to merge strings. These have different meanings
As you might have seen if you tested the help() function by looking up information on vectors, you will know that many functions and operations in R are designed to work naturally with vectors.
Task 2.2.1: Make a vector for the following weight values of miniature goats. Name your variable ‘goat.weight’
Goat weights: 13.3, 17.2, 14.8, 14.6, 12.4# The period between 'goat' and 'weight' has no special purpose.
# It only shows the person reading the code that 'weight' is information that pertains to 'goat'
goat.weight <- c(13.3, 17.2, 14.8, 14.6, 12.4)
The command you just ran has now appeared in your console (bottom
left window) - the goat.weight vector is now listed in the Environment
tab (top right window) under Values.
If at any point you want to view the value of a variable or data associated with a data object, simply enter the variable name and type ‘enter’ or ‘return’ to execute.
Task 2.2.2: Display (aka ‘print’) the contents of the vector containing the goat weights.
goat.weight
## [1] 13.3 17.2 14.8 14.6 12.4
Task 2.2.3: Display the weight of the second goat in the vector.
goat.weight[2]
## [1] 17.2
Hint:
data_object_name[index]
You have just worked with numeric vectors. Now let’s move to string vectors.
Task 2.2.4: Make a vector for the following name values
of miniature goats. Name your variable ‘goat.name’
Goat names: baby, pickles, cookie, sparkle, gabbie
*Note:* Text values must be wrapped in quotations. You can use double or single quotes, but must be consistent - Good: "text" - Good: 'text' - Bad: 'text"
goat.name <- c("baby", "pickes", "cookie", "sparkle", "gabbie")
To get the length of a vector, we can use the length()
function.
Task 2.2.5: Print (display) the length of the vector of miniature goat names.
*Note:* In a script (code editor), you often need to use the print() function explicitly to see the output, especially when running multiple lines of code or within functions. However, in the console, R automatically displays the output of expressions upon execution of the command.
length(goat.name)
## [1] 5
A ‘list’ can hold items of different types (even vectors), while
items in a ‘vector’ must all be the same type.
To make a list, we’ll use the list() function. >
Hint: Remember that all items in a vector must be the same
type, but can be different types if in a list.
Additional: If you want to create 2D lists, also known as a
table, you will create a matrix using the matrix()
function. - For more on matrices, check me
out{:target=“_blank”}. - Instead of creating our own matrices, we
will be importing data later on.
Reminder! Save your work
Statistics is: - the science of collecting, analyzing, interpreting - presenting data to uncover patterns and trends - making informed decisions based on this data.
If you’re unfamiliar with statistics, you can learn more about it from the w3school Statistics Tutorial{:target=“_blank”}
In this section, we’ll be focusing on - Basic statistical measures - Presenting data in a histogram - More on presenting data will be covered in Activity 4-Data Visualization{:target=“_blank”} - Importing data
The function names for the following three statistical measures (mean, median, standard deviation) are quite intutive. - It is just the name or abbreviation of the measure, - where the argument is the object containing the set of values we are analyzing. - Each function takes the vector as its argument.
These three functions are designed for sets of numerical and decimal
values. If run on other types (text, boolean), result will be
NA.
For this task, we will use a new vector object containing weights for a set of pigs.
Task 3.1.1:
Create a vector object with the weights of a set of pigs. Name your variable ‘pigs.weight’
Weights of pigs: 22, 27, 19, 25, 12, 22, 18
pigs.weight <- c(22, 27, 19, 25, 12, 22, 18)
Mean: the average value in a set.
This function calculates the sum of the in the set and divides the
sum by the number of items in the set. mean()
Task 3.1.2: Write and execute a command that outputs the mean value of the pigs’ weights
mean(pigs.weight)
## [1] 20.71429
This output is the average weight of all of the pigs
Median: The middle value in a sorted set
(e.g. lowest - highest). median()
Task 3.1.3:
Write and execute a command that outputs the median value of the pigs’ weights
median(pigs.weight)
## [1] 22
The output tells you the weight of the pig that falls between
the lighter half and the heavier half of the pigs.
Standard deviation: Describes how spread out the
data is. sd()
Task 3.1.4:
Write and execute a command that outputs the standard deviation of the pigs’ weights
sd(pigs.weight)
## [1] 4.956958
The output tells you how much the weights of the pigs vary from the
average weight. - A small standard deviation means that most pigs’
weights are close to the average, indicating uniformity in size. - A
large standard deviation suggests a wide range of weights.
We can also execute a ‘summary’ of our vector
objects of the pigs’ weights to generate several descriptive statistics
at the same time.
summary()
Task 3.1.5:
Display a summary of values pertaining to the pigs’ weights
summary(pigs.weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 18.50 22.00 20.71 23.50 27.00
Histogram: A graph used for understanding and analysing the distribution of values in a vector.
hist()
A histogram illustrates: - Where data points tend to cluster - The variability of data - The shape of variability
Task 3.2.1:
Create a histogram for the pigs’ weights.
hist(pigs.weight)
# The histogram will appear in the Plot tab.
The histogram will appear in the Plots tab (bottom right quadrant if you haven’t modified your RStudio layout).
We can also pass in additional parameters to control the way our plot looks.
Some of the frequently used parameters are:
main : The title of the plot
main = "This is the Plot Title" xlab : The x-axis label
xlab = "The X Label" ylab : The y-axis label
Task 3.2.2:
Create a histogram for the pigs’ weights, with axes labels.
Hint: Remember, a parameter is information that goes in the parenthesis of the function.
Single parameter: function_name(parameters)
Multiple parameters:
function_name(parameter1, parameter2)
# The first parameter is the name of the data (vector) object
# 'main' is the graph title
# 'xlab' is the label of the x-axis
# label parameters can be in any order, but following the data object
hist(pigs.weight,main='Histogram of Pig Weight',xlab='Weight')
# The histogram will appear in the Plot tab.
Additional: Use the ls() function to see all of the variables in our environment so far.
ls()
## [1] "goat.name" "goat.weight" "pig1.first_name" "pig1.full_name"
## [5] "pig1.heightA" "pig1.heightB" "pig1.heightGain" "pig1.large"
## [9] "pig1.last_name" "pig1.mini" "pigs.weight"
So far, we’ve create our own objects by manually entering all of the data in the console. In this section, we’ll learn how to create objects by importing (aka ‘reading’) data (compiled outside of R) into R and visualise it with a histogram.
R can handle multiple file types:
We can import data multiple ways. You’ll import here through “File” in the main menu. We’ll look at other ways in the following activity pages.
Task 4.1.1
Download and save this Excel spreadsheet of Income data{:target=“_blank”} - Note: Please remember where the income.xlsx file is saved (usually in a “downloads” or “desktop” folder).
Task 4.1.3: Import Import the dataset of Income data
From the top menu bar, select…
File
Import dataset
From Excel
In the ‘Import Excel Data’ window select your file by:
Entering the file path to the income.xlsx file you just downloaded.
Selecting “Browse” on the right side of the path bar and locating it in the browser.
Under ‘Import Options,’ make sure ‘Name’ is the same text as you wish for the variable to be named. Ours will be ‘income’.
Click “Import”
?? In Yes to install the “readxl” package.
Note: Don’t worry about making a mistake importing this
data. You can always remove it using the rm()
function.
What you just imported is now stored as a ‘data frame’ object whose
name is income.
Definition - Data frame: essentially a table. It is 2-dimensional object that can hold different types of data types.
*Additional:* Data frames contain information about a set of objects (e.g., cats).
- The data frame will contain one or more columns and one or more rows.
- One column contains related values (column 1 = age, column 2 = eye color).
Because the column contains the same type of information, it is equivalent to a vector. I.e., the ‘eye color’ column will contain characters, not numbers.
One row denotes one object from the set. In a data frame of information about a set of cats, each row is information about one specific cats.
A row can contain many different bits of information, like age (numerical), eye color (character), breed (character), whether or not it’s spayed/neutered (boolean). Because rows may contain values of different types, one row would most likely not be a vector. It would likely be a list, which can contain values of different types.
To see the data in our data frame, simply enter the name of the data frame in the console and type ‘enter’ or ‘return’.
income
The following will be the output:
## # A tibble: 10 × 4
## id gender income experience
## <dbl> <chr> <dbl> <dbl>
## 1 1 M 23000 3
## 2 2 M 55000 7
## 3 3 M 43000 5
## 4 4 F 37000 5
## 5 5 M 75000 9
## 6 6 M 72000 10
## 7 7 F 121000 13
## 8 8 F 27000 1
## 9 9 F 57000 8
## 10 10 F 91000 10
*Note:* We will explore other ways to view and preview content of our data frames in Activity 3.
*Note:* `<char>` stands for "character" data type and `<dbl>` stands for "double-precision floating point numbers data" type. <br>
We can see now that our data frame income contains 10
objects (rows), and 4 variables (columns) - It can be inferred that this
data relates to 10 people - The values with each person are: - id (in
lieu of a name) (dbl) - gender (char) - income (dbl) - experience
(dbl)
Display a summary of statistics for the income data.
summary(income)
## id gender income experience
## Min. : 1.00 Length:10 Min. : 23000 Min. : 1.00
## 1st Qu.: 3.25 Class :character 1st Qu.: 38500 1st Qu.: 5.00
## Median : 5.50 Mode :character Median : 56000 Median : 7.50
## Mean : 5.50 Mean : 60100 Mean : 7.10
## 3rd Qu.: 7.75 3rd Qu.: 74250 3rd Qu.: 9.75
## Max. :10.00 Max. :121000 Max. :13.00
In 3.2 we made a histogram to visualize the distribution of the pig weights. Remember that the parameter that the histogram function takes is a vector.
To extract a vector (column) from our data frame, we will pass in
_dataframeName_$_columnName_, where the name of our data is
separated by the name identifying a single set of values within that
data frame.
Display the vector of data relating to ‘experience’ in a histogram. -
X-label: ‘Experience’ - Title: ‘Histogram of Experience’
#Remember, the generated histogram will appear in the Plot tab.
hist(income$experience, main='Histogram of Experience',xlab='Experience')
The following will be the output:
We can see in the histogram that there are 7 intervals with equally spaced breaks. In this case, the height of a cell is equal to the number of observations falling in that cell. - Why are there 7 intervals? R automatically chooses the number of intervals for you.
Additional: If you preferred having 4 intervals (i.e.,
‘bins’), use can set that using the breaks=''
parameter.
#breaks is equal to the number of intervals
hist(income$experience, main='Histogram of Experience',xlab='Experience', breaks=4)